Log-based Approaches to Characterizing and Diagnosing MapReduce Systems
نویسندگان
چکیده
MapReduce programs and systems are large-scale, highly distributed and parallel, consisting of many interdependent Map and Reduce tasks executing simultaneously on potentially large numbers of cluster nodes. They typically process large datasets and run for long durations. Thus, diagnosing failures in MapReduce programs is challenging due to their scale. This renders traditional time-based Service-Level Objectives ineffective. Hence, even detecting whether a MapReduce program is suffering from a performance problem is difficult. Tools for debugging and profiling traditional programs are not suitable for MapReduce programs, as they generate too much information at the scale of MapReduce programs, do not fully expose the distributed interdependencies, and do not expose information at the MapReduce level of abstraction. Hadoop, the open-source implementation of MapReduce, natively generates logs that record the system’s execution, with low overheads. From these logs, we can extract state-machine views of Hadoop’s execution, and we can synthesize these views to create a single unified, causal, distributed control-flow and data-flow view of MapReduce program behavior. This state-machine view enables us to diagnose problems in MapReduce systems. We can also generate visualizations of MapReduce programs in combinations of the time, space, and volume dimensions of their behavior that can aid users in reasoning about and debugging performance problems. We evaluate our diagnosis algorithm based on these state-machine views on synthetically injected faults on Hadoop clusters on Amazon’s EC2 infrastructure. Several examples illustrate how our visualization tools were used to optimize application performance on the production M45 Hadoop cluster.
منابع مشابه
Diagnosing Heterogeneous Hadoop Clusters
We present a data-driven approach for diagnosing performance issues in heterogeneous Hadoop clusters. Hadoop is a popular and extremely successful framework for horizontally scalable distributed computing over large data sets based on the MapReduce framework. In its current implementation, Hadoop assumes a homogeneous cluster of compute nodes. This assumption manifests in Hadoop’s scheduling al...
متن کاملIn-situ MapReduce for Log Processing
Log analytics are a bedrock component of running many of today’s Internet sites. Application and click logs form the basis for tracking and analyzing customer behaviors and preferences, and they form the basic inputs to ad-targeting algorithms. Logs are also critical for performance and security monitoring, debugging, and optimizing the large compute infrastructures that make up the compute “cl...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملMore Intervention Now!
Techniques for characterizing performance and diagnosing problems typically endeavor to minimize perturbation by measurements and data collection. We are making a call to do exactly the opposite. In order to characterize the behavior of a system and to perform root-cause analysis and answer what-if questions, we need to conduct active and systematic experiments on our systems, perhaps at the sa...
متن کاملDistributed Log Analysis on the Cloud Using MapReduce
Original scientific paper In this paper we describe our work on designing a web based, distributed data analysis system based on the popular MapReduce framework deployed on a small cloud; developed specifically for analyzing web server logs. The log analysis system consists of several cluster nodes, it splits the large log files on a distributed file system and quickly processes them using MapR...
متن کامل